A novel optimized tiny YOLOv3 algorithm for the identification of objects in the lawn environment

Based on the problem of insufficient accuracy of the original tiny YOLOv3 algorithm for object detection in a lawn environment, an Optimized tiny YOLOv3 algorithm with less computation and higher accuracy is proposed. Three reasons affect the accuracy of the original tiny YOLOv3 algorithm for detecting objects in a lawn environment. First, the backbone of the original algorithm is composed of a stack of a single convolutional layer and a max-pooling layer, which results in insufficient ability to extract feature information of objects. An enhancement module is proposed to enhance the feature extraction capability of the shallow layers of the network. Second, the information of the shallow convolutional layers of the backbone is not fully used, which results in insufficient detection capability for small objects. Third, the deep part of the backbone uses a convolutional layer with an excessive number of channels, which results in a large amount of computation. A multi-resolution fusion module is proposed to enhance the information interaction capability of the deep and shallow layers of the network, and reduce the computation. To verify the accuracy of this Optimized tiny YOLOv3 algorithm, the algorithm was tested on the dataset containing trunk, spherical tree and person, and compared with the current research. The results show that the algorithm proposed in this paper improves the detection accuracy while reducing the calculation.

layers to the tiny YOLOv3 network. Gai 17 proposed the improved tiny YOLOv3 for real-time object detection by adding convolutional layers. He 18 proposed the TF-YOLO to increase the detection accuracy of the tiny YOLOv3 network by adding one YOLO layer. Wu 19 proposed a light YOLOv3 network to detect apples using a residual block composed of depthwise separable convolutions. Liu 20 increased the detection accuracy of the tiny YOLOv3 network by adding one YOLO layer.
The original tiny YOLOv3 algorithm has low detection accuracy on our lawn environment target dataset. (1) We design an enhancement module to improve the detection accuracy of backbone. (2) We design a multiresolution fusion module to enhance the information interaction capability inside the backbone and reduce the amount of calculation. (3) On this basis, the Optimized tiny YOLOv3 algorithm is proposed. Comparison experiment with the current three lightweight YOLO algorithms shows that the algorithm proposed in this paper is superior to the others in terms of accuracy and lightweight degree.

Optimized tiny YOLOv3 algorithm network
Loss function. To improve the speed and convergence of the bounding box regression, the CIoU 21 (complete intersection over union) loss function is adopted as the loss function, R CIoU is the where, α is the weight function; v is the similarity parameter of the aspect ratio; b and b gt represent the center points of the bounding box and the ground truth, respectively; ρ is the Euclidean distance between the two center points; f is the diagonal length of the smallest rectangle that can contain both the bounding box and the ground truth, as shown in Fig. 1. where, H, W and C out are the width, height and number of channels of the output feature map, respectively. C in is the number of channels of the input feature map. K h and K w are the sizes of the convolution kernel.
(1) www.nature.com/scientificreports/ In order to solve the problem of insufficient utilization of the shallow feature information of the network, the backbone is improved, and an enhancement module is proposed to strengthen feature extraction, as shown in Fig. 2.
The convolutional layers of 3 × 3 and 1 × 1 are used to enhance feature extraction and fusion for the feature map output from the 6th layer of the network, instead of using the max pooling to reduce the dimension. The point convolution is used for cross-channel fusion and the number of channels is compressed from 128 to 64, which reduces the amount of computation. A 3 × 3 convolutional layer is used for feature extraction, and the number of output channels is still 64. Finally, the point convolution is used to expand the number of channels to 128. Different from the 7th layer in the original backbone that uses a max-pooling layer to reduce the size, a 3 × 3 convolutional layer with stride 2 is used to extract feature information while reducing the dimension. A point convolutional layer is used to compress the channel of the feature map to 64, and then expand it to 128, compress the invalid feature map with little calculation, and generate effective feature maps. A 3 × 3 convolutional layer is used to extract feature maps, resulting in the same output as the original network. It can be seen from Eq. (5) that the calculation amount of the enhancement module is 0.9BFLOPs, which is only 0.1BFLOPs increased compared to the original tiny YOLOv3.

Multi-resolution fusion module.
The max-pooling layer is widely used by the backbone of the original tiny YOLOv3, resulting in the loss of a large amount of semantic information during the downsampling process and the missed detection of small objects. In order to solve the above problems and reduce the increased amount of calculation in the enhancement module, a multi-resolution fusion module is proposed.
The last layer in the original Tiny YOLOv3 backbone is at the top of the FPN 22 (feature pyramid networks) network. The resolution of the output feature map is 13 × 13 and the number of channels is 1024, which requires a huge amount of calculation. A multi-resolution fusion module is used to replace this layer. Since the backbone contains few convolutional layers, only making full use of the convolutional feature maps of the shallow layers of the network can improve the detection accuracy. The multi-resolution fusion module is shown in Fig. 3.
In the first part, a convolutional layer of size 3 × 3 and stride 2 is used to extract and reduce the dimension of the feature map output by the 13th layer of the backbone after adding the enhancement module. The second part is the feature map of the 16th layer output of the backbone after adding the enhancement module. In the third part, a convolutional layer of size 3 × 3 and stride 2 is used to extract and reduce the dimension of the feature map output by the 10th layer of the backbone after adding the enhancement module. Finally, the three parts are spliced with concat to form a feature map of 13 × 13 × 896.
It can be seen from Eq. (5) that the calculation amount of the last layer of the original tiny YOLOv3 is 1.60BFLOPs, and the calculation amount of the multi-resolution fusion module is 0.25BFLOPs. It can be seen that the multi-resolution fusion module not only further utilizes the information of the shallow layers of the backbone, but also reduces the amount of calculation by 1.35 BFLOPs.
Network structure. The whole Optimized network is shown in Fig. 4.
The total calculation amount of the original tiny YOLOv3 is 5.45BFLOPs, and the total calculation amount of Optimized tiny YOLOv3 using enhancement module and multi-resolution fusion module is 5.25BFLOPs, a reduction of 3.7%. www.nature.com/scientificreports/

Materials and methods
Dataset. The lawn environment object dataset was made to train the algorithm. The original static obstacles in the lawn environment mainly comprise trunks and spherical trees, whereas the dynamic obstacles mainly comprise people. There exists no publicly available dataset that meets the requirements of this paper. Therefore, to verify the optimized algorithm network, it is necessary to create the dataset. The developed dataset has three main categories, namely, trunk, spherical tree, and person. Trunks and spherical trees were taken from the lawn environment field shooting, a total of 8059, including 7922 trunk samples and 567 spherical tree samples, and the picture size was 564 × 422 × 3. Trunk and spherical tree are shown in Fig. 5. The person dataset was derived from all the pictures in the PASCAL VOC 2007 covering the person category, with 4012 pictures. The entire dataset has 12,071 pictures. To accelerate the network convergence speed and prevent gradient explosion, the labeled data are normalized, and the format of the normalized labeled data is where, class_id is the object category, trunk is 0, spherical tree is 1, and person is 2. x and y are the coordinates of the center point, and w and h denote the width and height of the normalized object bounding box, respectively.
where, x min and y min are the coordinates of the upper left corner of the target bounding box, x max and y max are the coordinates of the lower right corner of the object bounding box, and u and v are the width and height of the picture, respectively, as shown in Fig. 6.
The dataset was randomly divided, and the number of images in the training set, validation set and testing set are 7727, 1932 and 2412, respectively. The anchor boxes are clustered using the k-means++ algorithm, an extension of k-means 23  Algorithm training. In terms of the network training platform, the operating system is Windows 10, the CPU is an Intel i7-9700KF with 3.6 GHz clock speed, the memory size is 16 GB, the GPU is a NVIDIA GEFORCE RTX 2070 Super with 8 GB memory size, the deep learning framework is AlexeyAB-Darknet, and the compilation environment is Visual Studio 2015 with C/C++ language.
The total number of training iterations of the Optimized tiny YOLOv3 algorithm was 25,000, the initial learning rate was set to 0.00261, and when it was trained to 15,000 rounds and 25,000 rounds, learning rate was reduced to 10%. The decay value was set to 0.0005. During the training process, the images were rotated and the hue and saturation were changed to prevent overfitting. The loss function uses CIoU, and the type of class_id x y w h , y = (y max + y min )/2v, www.nature.com/scientificreports/ non-maximum suppression is greedy-NMS (greedy non maximum suppression). The Optimized tiny YOLOv3 algorithm is compared with original tiny YOLOv3, Improved tiny YOLOv3 and TF-YOLO in terms of loss and mAP (mean average precision) on the lawn environment object dataset. The curve of the loss function value and the number of training iterations of each algorithm during the training process is shown in Fig. 7. The red line in the figure is where the loss function value is 0.5, and the loss function value below 0.5 can be considered to have good detection performance.
The higher the relative height of the red line from the bottom horizontal axis, the better the convergence of the algorithm. In Fig. 7, TF-YOLO has the highest convergence, followed by Optimized tiny YOLOv3. Compared with original tiny YOLOv3, the algorithm proposed in this paper has a significant improvement in convergence.
During the training process, the mAP curve is shown in the Fig. 8. In Fig. 8, Optimized tiny YOLOv3 has the highest mAP value on the validation set compared to all other algorithms, and has the best training effect on the training set. www.nature.com/scientificreports/

Results and discussion
The AP (average precision) of each class and the mAP of all classes are used to evaluate the accuracy of algorithms, and the amount of calculation is used to evaluate the lightweight degree. The four algorithms involved in training are tested on the testing set, and the AP of each class and the overall mAP are counted, as shown in Fig. 9. It can be seen from Fig. 9 that the detection accuracy of the Optimized tiny YOLOv3 algorithm proposed in this paper for trunk, spherical tree and person is improved by 8.03%, 7.04% and 8.34% respectively compared with the original tiny YOLOv3. The mAP value has improved significantly, increasing by 7.8%. The data tested on the testing set of the algorithm proposed in this paper show that it has the best performance in terms of detection accuracy.
The calculation amount of each algorithm is shown in Fig. 10. It can be seen from Fig. 10 that the Optimized tiny YOLOv3 algorithm proposed in this paper has the smallest amount of computation. Compared with original tiny YOLOv3, amount of calculation is reduced by 0.2BFLOPs. The calculation amount of Optimized tiny YOLOv3 is much smaller than the TF-YOLO algorithm.  www.nature.com/scientificreports/ Compared with the original tiny YOLOv3 algorithm, the Optimized tiny YOLOv3 algorithm proposed in this paper greatly improves the detection accuracy under the condition that the lightweight degree is slightly improved, and the accuracy and lightweight degree are better than Improved tiny YOLOv3 and TF-YOLO.

Conclusions
In this work, we explore the application of deep learning-based object detection technology in the lawn environment, providing a research example for agricultural intelligence.
A dataset containing trunk, spherical tree and person is specially made for the lawn environment, which provides dataset support for subsequent neural network-based target detection algorithms.
Aiming at the problem of insufficient backbone extraction capability of the original tiny YOLOv3 algorithm, an enhancement module is proposed to enhance the feature extraction capability. A multi-resolution fusion module is proposed for the poor detection of small objects by the backbone to strengthen the information interaction between the deep and shallow convolutional layers.  www.nature.com/scientificreports/ Based on the enhancement module and the multi-resolution fusion module, Optimized tiny YOLOv3 algorithm is proposed. Experiment on the dataset shows that the algorithm proposed in this paper not only reduces the amount of calculation but also greatly improves the detection accuracy.

Data availability
All data generated or analysed during this study are included in this published article.